{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Datashader is designed to make it simple to work with even very large\n",
    "datasets. To get good performance, it is essential that each step in the\n",
    "overall processing pipeline be set up appropriately. Below we share some\n",
    "of our suggestions based on our own [Benchmarking] and optimization\n",
    "experience, which should help you obtain suitable performance in your\n",
    "own work.\n",
    "\n",
    "File formats\n",
    "------------\n",
    "\n",
    "Based on our [testing with various file formats], we recommend storing\n",
    "any large columnar datasets in the [Apache Parquet] format when\n",
    "possible, using the [fastparquet] library with \"Snappy\" compression:\n",
    "\n",
    "```\n",
    ">>> import dask.dataframe as dd\n",
    ">>> dd.to_parquet(filename, df, compression=\"SNAPPY\")\n",
    "```\n",
    "\n",
    "If your data includes categorical values that take on a limited, fixed\n",
    "number of possible values (e.g. \"Male\", \"Female\"),\n",
    "Parquet's categorical columns use a more memory-efficient data representation and\n",
    "are optimized for common operations such as sorting and finding uniques.\n",
    "Before saving, just convert the column as follows:\n",
    "\n",
    "```\n",
    ">>> df[colname] = df[colname].astype('category')\n",
    "```\n",
    "\n",
    "By default, numerical datasets typically use 64-bit floats, but many\n",
    "applications do not require 64-bit precision when aggregating over a\n",
    "very large number of datapoints to show a distribution. Using 32-bit\n",
    "floats reduces storage and memory requirements in half, and also\n",
    "typically greatly speeds up computations because only half as much data\n",
    "needs to be accessed in memory. If applicable to your particular\n",
    "situation, just convert the data type before generating the file:\n",
    "\n",
    "```\n",
    ">>> df[colname] = df[colname].astype(numpy.float32)\n",
    "```\n",
    "\n",
    "Single machine\n",
    "--------------\n",
    "\n",
    "Datashader supports both Pandas and Dask dataframes, but Dask dataframes\n",
    "typically give higher performance even on a single machine, because it\n",
    "makes good use of all available cores, and it also supports out-of-core\n",
    "operation for datasets larger than memory.\n",
    "\n",
    "Dasks works on chunks of the data at any one time, called partitions.\n",
    "With dask on a single machine, a rule of thumb for the number of\n",
    "partitions to use is `multiprocessing.cpu_count()`, which allows Dask to\n",
    "use one thread per core for parallelizing computations.\n",
    "\n",
    "When the entire dataset fits into memory at once, you can persist the\n",
    "data as a Dask dataframe prior to passing it into datashader, to ensure\n",
    "that data only needs to be loaded once:\n",
    "\n",
    "```\n",
    ">>> from dask import dataframe as dd\n",
    ">>> import multiprocessing as mp\n",
    ">>> dask_df = dd.from_pandas(df, npartitions=mp.cpu_count())\n",
    ">>> dask_df.persist()\n",
    "...\n",
    ">>> cvs = datashader.Canvas(...)\n",
    ">>> agg = cvs.points(dask_df, ...)\n",
    "```\n",
    "\n",
    "  [Benchmarking]: https://github.com/bokeh/datashader/issues/313\n",
    "  [testing with various file formats]: https://github.com/bokeh/datashader/issues/129\n",
    "  [Apache Parquet]: https://parquet.apache.org/\n",
    "  [fastparquet]: https://github.com/dask/fastparquet"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
